An Interactive Framework for Data Cleaning

نویسندگان

  • Vijayshankar Raman
  • Joseph M. Hellerstein
چکیده

Cleaning organizational data of discrepancies in structure and content is important for data warehousing and Enterprise Data Integration (EDI). Current commercial solutions for data cleaning involve many iterations of time-consuming “data quality” analysis to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs. We present an interactive framework for data cleaning that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. In the background, the system incrementally searches for discrepancies on the latest transformed version of data, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. Balancing the goals of power, ease of specification, and interactive application, we choose a set of transforms that can be used for transformations within data records as well as for higher-order transformations. We also present initial work on optimizing a sequence of transforms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Potter ' s Wheel : An Interactive Framework for Data Cleaning and Transformation

Real world data often has discrepancies in structure and content. Traditional methods for \cleaning" the data involve many iterations of time-consuming \data quality" analysis to nd discrepancies, and long-running transformations to x them. This process requires users to endure long waits and often write complex transformation programs. We present an interactive framework for data cleaning that...

متن کامل

Tabular Data Cleaning and Linked Data Generation with Grafterizer

Over the past several years the amount of published open data has increased significantly. The majority of this is tabular data, that requires powerful and flexible approaches for data cleaning and preparation in order to convert it into Linked Data. This paper introduces Grafterizer – a software framework developed to support data workers and data developers in the process of converting raw ta...

متن کامل

TAILOR: A Record Linkage Tool Box

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, ...

متن کامل

Research and Realization of the Extensible Data Cleaning Framework

This paper proposes the idea of establishing an extensible data cleaning framework which is based on the key technology of data cleaning, and the framework includes open rules library and algorithms library. This paper gives the descriptions of model principle and working process of the extensible data cleaning framework, and the validity of the framework is verified by experiment. When the dat...

متن کامل

A Framework for Data Cleaning in Data Warehouses

It is a persistent challenge to achieve a high quality of data in data warehouses. Data cleaning is a crucial task for such a challenge. To deal with this challenge, a set of methods and tools has been developed. However, there are still at least two questions needed to be answered: How to improve the efficiency while performing data cleaning? How to improve the degree of automation when perfor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000